BPT-PLR: A Balanced Partitioning and Training Framework with Pseudo-Label Relaxed Contrastive Loss for Noisy Label Learning

While collecting training data, even with the manual verification of experts from crowdsourcing platforms, eliminating incorrect annotations (noisy labels) completely is difficult and expensive. In dealing with datasets that contain noisy labels, over-parameterized deep neural networks (DNNs) tend to overfit, leading to poor generalization and classification performance. As a result, noisy label learning (NLL) has received significant attention in recent years. Existing research shows that although DNNs eventually fit all training data, they first prioritize fitting clean samples, then gradually overfit to noisy samples. Mainstream methods utilize this characteristic to divide training data but face two issues: class imbalance in the segmented data subsets and the optimization conflict between unsupervised contrastive representation learning and supervised learning. To address these issues, we propose a Balanced Partitioning and Training framework with Pseudo-Label Relaxed contrastive loss called BPT-PLR, which includes two crucial processes: a balanced partitioning process with a two-dimensional Gaussian mixture model (BP-GMM) and a semi-supervised oversampling training process with a pseudo-label relaxed contrastive loss (SSO-PLR). The former utilizes both semantic feature information and model prediction results to identify noisy labels, introducing a balancing strategy to maintain class balance in the divided subsets as much as possible. The latter adopts the latest pseudo-label relaxed contrastive loss to replace unsupervised contrastive loss, reducing optimization conflicts between semi-supervised and unsupervised contrastive losses to improve performance. We validate the effectiveness of BPT-PLR on four benchmark datasets in the NLL field: CIFAR-10/100, Animal-10N, and Clothing1M. Extensive experiments comparing with state-of-the-art methods demonstrate that BPT-PLR can achieve optimal or near-optimal performance.


Introduction
Large-scale, accurately labeled image data are one of the key prerequisites for the success of deep neural networks (DNNs) in numerous computer vision (CV) tasks, such as image captioning [1], image classification [2], segmentation [3,4], etc.However, collecting these large-scale, high-quality annotated datasets requires significant manpower and resources.The current data collection process mainly involves scraping data from search engines, forums, and other websites and then relying on the help of a large number of annotation experts on crowdsourcing platforms (Amazon Mechanical Turk, etc.) to cross-check and correct the tags.This process is time-consuming and becomes more challenging as the dataset size increases, leading to partially inaccurate annotations (noisy labels) even Entropy 2024, 26, 589 2 of 29 after verification.A wealth of research has shown that due to the over-parameterization of DNNs, they attempt to fit labels for all samples, including noisy labels, severely compromising the generalization performance of DNNs.Therefore, existing research focuses on collecting data without relying on manual annotation and assisting DNNs in learning from noisy datasets.This aims to prevent overfitting to noisy samples while maintaining performance levels close to those achieved when learning from clean datasets, known as noisy label learning (NLL) research.
Existing research indicates that although DNNs eventually fit all samples, they initially fit the predominant clean-label samples in the dataset and gradually overfit the noisylabeled samples [2,5].This memorization characteristic of DNNs results in clean samples having smaller losses in the early stage, while noisy samples exhibit larger losses, hence termed the small-loss criterion [5], widely employed in methods learning from noisy labels.Centered around the memorization characteristic of DNNs and the small-loss criterion, existing NLL methods can be categorized into three types: robust training loss, label correction, and sample selection.The first two methods will be introduced in the next section.Due to the superior performance of recent sample selection methods, it is crucial to conduct relevant research.Our method can also be categorized in this direction.Early sample selection techniques utilize the small-loss criterion, selecting samples with smaller cross-entropy losses during training as a subset of clean labels for supervised training.However, these methods perform inferiorly due to the inadequate utilization of training data compared with other types of methods.With further advancement in NLL research, some sample selection methods have begun utilizing various loss distribution estimation methods (i.e., GMM, the beta mixed model) to partition training data based on the smallloss criterion, retaining observed labels of samples with smaller losses (labeled samples) and discarding labels of samples with larger losses (unlabeled samples).Subsequently, semi-supervised learning (SSL) techniques and contrastive representation learning (CRL) techniques are introduced to deeply train the partitioned training subsets and improve performance.Existing sample selection methods based on SSL techniques mostly derive from DivideMix, differing significantly in data partitioning techniques and semi-supervised training strategies aimed at enhancing model robustness.Although these methods have achieved certain results, their performance still has room for further improvement due to issues such as class imbalance in the partitioned subsets and optimization conflicts between contrastive representation losses and supervised losses.Although PLReMix has addressed some of these issues, there are still some challenges remaining.Compared to the original DivideMix, PLReMix primarily introduces a dual-component GMM based on sample semantic and category information for data partitioning during the sample selection process.Subsequently, the robust training process integrates the new pseudolabel relaxed contrastive loss (PLR) with existing SSL techniques.According to our analysis, this method faces two main issues: (1) During actual training, it is challenging for the model to completely avoid the influence of noisy labels in the early stages, resulting in many clean samples being mislabeled as noisy, especially in high-noise scenarios where PLReMix tends to generate a large number of false positives as depicted in Figure 3 under 90% symmetric noise and 40% and 49% asymmetric noise scenarios.(2) In the SSL training process, the number of model iterations per epoch depends on the size of the current labeled set.In high-noise scenarios, however, the number of labeled samples is much smaller than the unlabeled ones (as shown in Figure 5), preventing the model from fully learning the data distribution and thereby limiting performance improvements.
To address the issue in existing sample selection methods, we propose a framework named BPT-PLR (Balanced Partitioning and Training framework with Pseudo-Label Relaxed contrastive loss).This framework follows the structural design of existing sample selection methods based on SSL techniques such as DivideMix [6], PLReMix [7], Lon-gReMix [8], and C2MT [9], but introduces two key processes: a balanced partitioning process with a two-dimensional Gaussian mixture model (BP-GMM) and a semi-supervised oversampling training process with a pseudo-label relaxed contrastive loss (SSO-PLR).
Entropy 2024, 26, 589 3 of 29 Similar to PLReMix, our method employs a dual-component GMM during the BP-GMM process to model both the semantic and class information of samples.However, as shown in Figure 3, the divided labeled set is not entirely reliable.Furthermore, to mitigate the impact of class imbalance on model performance, we adopt a class-level balanced selection strategy to ensure that the number of samples in each class of the filtered labeled subset is as close as possible.Additionally, while CRL can enable the model to learn intrinsic semantic information of data independent of noisy labels, aiding in selecting samples containing noisy labels, it conflicts with the supervised loss (e.g., CE) when cooperating with SSL techniques.Therefore, the SSO-PLR process combines PLR with SSL techniques, obtaining more reliable negative pairs by checking whether the top K indices of prediction probabilities between different samples have an empty intersection.This preserves resistance to noisy labels and avoids conflicts with supervised loss.As mentioned above, the number of labeled samples is much smaller than the unlabeled ones.Therefore, we introduce oversampling techniques to overcome the problem of existing sample selection methods failing to fully exploit feature information from unlabeled samples during the SSL process.We validate its effectiveness on four benchmark datasets in the NLL domain, and extensive experiments demonstrate that compared with state-of-the-art (SOTA) methods, BPT-PLR can achieve similar or better test performances.The source code is available at https://github.com/LanXiaoPang613/BPT-PLR(accessed on 5 July 2024).Our main contributions are as follows: 1.
We propose an improved end-to-end training framework called BPT-PLR (Balanced Partitioning and Training framework with Pseudo-Label Relaxed contrastive loss) to address issues of noisy label learning (NLL) in DNNs, such as class imbalance in partitioned subsets and optimization conflicts between CRL losses and supervised losses.This framework enhances DNNs' robustness to noisy labels and achieves superior performance.

2.
We introduce a novel class-level balanced selection method based on a two-dimensional Gaussian mixture model (GMM).This method first models both the semantic and class information of the data using a two-dimensional GMM and then utilizes a class-level balanced selection strategy based on the distribution of samples to partition the data.This ensures that the labeled subset after partitioning maintains class balance, thereby alleviating the impact of the long-tail issue on model accuracy.

3.
We incorporate the existing PLR loss into a semi-supervised learning (SSL) framework following previous work but further leverage it through oversampling techniques.This process enhances the model's learning of semantic information from both labeled and unlabeled samples, thereby improving test performance.

4.
We demonstrate the effectiveness of BPT-PLR through extensive experiments on several classic datasets in the NLL field.Additionally, we validate the robustness of the two key processes proposed through ablation experiments.
The structure of this paper is outlined as follows: In Section 2, we introduce some existing research relevant to the method proposed in this paper.Section 3 is dedicated to introducing our method, while Section 4 provides a detailed explanation of the experiments and comparisons.Finally, we conclude in Section 6.

Related Works
This section mainly introduces recent research in the fields of noisy label learning (NLL) and contrastive representation learning (CRL).

Recent Research on NLL
Robust training loss.Due to the widely used cross-entropy (CE) loss in classification tasks causing DNNs to be prone to overfitting noisy labels, leading to poor generalization performance, many studies deliberately design losses that are insensitive and underfitting to noisy labels to substitute for cross-entropy during training.Since Natarajan et al. [10] proved that if the loss function satisfies the symmetry condition, it is robust to label noise, many studies were conducted around it.For instance, Zhang et al. [11] have demonstrated that while the Mean Absolute Error (MAE) exhibits robustness to noisy labels under symmetry conditions, this robustness can increase training difficulty and decrease model performance.Therefore, they combined CE with MAE to propose Generalized Cross Entropy (GCE) loss, which possesses not only the advantage of CE's rapid convergence but also the robustness of MAE to noisy labels.Similarly, inspired by [12], Oaraei et al. [13] proposed a convex surrogate of the unbiased 0-1 loss for content recommendation and multimedia search tasks, which typically encounter issues of class imbalance and missing labels [14].Additionally, inspired by the symmetric Kullback-Leibler (KL) divergence, Wang et al. [15] introduced Symmetric Cross-Entropy (SCE) loss and theoretically demonstrated its robustness to noisy labels under certain conditions.Zhang et al. [16] proposed a novel loss function called Mixup, which interpolates between any two samples according to a beta distribution and then computes the CE loss for the interpolated sample.This method has been widely adopted in the field of NLL.Recently, Ye et al. [17] integrated activation loss functions with strategies like supplementary label learning to devise a normalized negative loss function [18], replacing the MAE loss used in active-passive loss.This approach enables the model to focus more on learning clean samples.Additionally, Jain et al. presented a propensity-scored loss for extreme multi-label learning, which is useful for addressing tagging tasks and has the potential to be expanded to the task of pseudo-label generation in NLL research.However, as these functions are designed to underfit noisy labels, they also underfit a portion of clean samples that are difficult to distinguish, resulting in poor performance.
Label correction.The label correction process primarily leverages the memorization characteristic of DNNs, where after a certain time of pre-training, model predictions are used to replace observed labels of samples to alleviate the impact of noisy labels on model performance.The joint optimization framework [19] directly utilizes model prediction to replace original labels, which cannot finely update each sample, leading to model performance fluctuations.Yi et al. [20] proposed the PENCIL framework to continuously correct labels based on the gradients generated when each sample participates in loss computation and backpropagation, thus alleviating the fluctuation.Building upon this, Zhang et al. [2] introduced Mixup [16] and balance terms to enhance the label correction capability further and ref. [21] proposed a novel label correction framework for featuredependent label noise.Additionally, Xu et al. [22] introduced contrastive prototypical loss to maximize the distance between the class cluster and the data point and assist in the label correction process.Similarly, Huang et al. [23] employed supervised contrastive learning techniques to guide the label correction process, achieving certain improvements.Wang et al. [24] proposed an end-to-end dynamic correction method for NLL, which utilizes the knowledge from past epochs to combat label noise.However, these methods exhibit performance fluctuations when faced with real-world datasets, thus casting doubt on their practical utility.
Sample selection.The early sample selection methods only select samples with smaller losses for training to mitigate the impact of noisy labels.For example, Co-teaching [25] employs two networks to alternately select small loss samples for training, while CJCnet [26] eliminates noisy labels through cross-training and learning rate oscillation strategies.As research progresses, DivideMix [6] and ELR [27] pioneers combine SSL techniques with sample selection methods to fully utilize the information carried by both clean and noisy samples, achieving significant progress.Subsequently, Karim et al. [28] introduced unsupervised CRL and Jensen-Shannon divergence (JSD) into semi-supervised sample selection methods to further boost performance.Zhang et al. [29] proposed a new sample selection and weighting method called Hyper-spherical Margin Weighting (HMW) and embedded it into [28].Feng et al. [30] applied optimal transport theory to the sample selection process.Li et al. [31] adopted different dynamic thresholds for selecting clean, challenging, and noisy samples, combined with semi-supervised learning techniques to improve performance.Additionally, Zhang et al. [9] further improved DivideMix by introducing cross-to-merge training strategies and median balance strategies to enhance performance.Cordeiro et al. [8] decomposed the sample selection and robust training processes of DivideMix into two steps for targeted optimization, achieving certain progress.Sun et al. [32] simplified the sample selection problem into a clustering problem and introduced twin contrastive clustering to resolve it.Deng et al. [33] proposed SLRLNL to separate noisy labels from hard yet clean samples to improve model robustness.

Recent Research on CRL
CRL is a representative self-supervised learning technique that can learn feature representations independent of labels.During training, positive and negative examples from a batch of data need to be constructed to calculate the InfoNCE loss.SimCLR [34] uses two strong data augmentations for each sample as positives, while considering other samples as negatives to compute InfoNCE, thus requiring larger batch sizes.Meanwhile, MoCo [35] utilizes a momentum encoder and a queue to generate negatives for samples, reducing the batch size.Additionally, Khosla et al. [36] extended self-supervised CRL to a fully supervised setting by leveraging label information, where samples of the same class are treated as positives and samples of different classes are treated as negatives.Li et al. [37] calculated the moving average low-dimensional embeddings of each class to obtain category prototypes and utilized these prototypes to perform CRL.Due to the capability of CRL to enable models to learn semantic information independent of labels in the data, it holds great potential for application in the NLL field.However, the labels of samples in noisy datasets are unreliable, resulting in fewer applications of supervised CRL [38].Instead, many NLL methods introduce unsupervised CRL techniques to enhance the robustness of models to noisy labels [39].However, ref. [7] demonstrated conflicting optimization between the contrastive loss computed using unsupervised CRL and the supervised loss computed using model output values and observed labels.This conflict limits further improvement in model testing accuracy.Therefore, they define reliable negative pairs as those where the intersection of the top K indices of predicted probabilities for any sample and the top K indices of a given sample is empty and utilize these negative pairs to compute CRL, reducing the optimization conflict between contrastive loss and supervised loss.However, PLReMix [7] requires using different types of similar PLR losses for different types of datasets (for example, using Flat PLR for CIFAR [40] and using native PLR for Clothing1M [41]), and the performance varies significantly.Although our method adopts the proposed PLR loss, we successfully overcome these challenges by introducing two key processes.

Algorithm
Common DNNs for classification tasks typically consist of a feature extractor f (•, θ) and a classifier h(•, φ), where θ and φ are the corresponding learnable parameters.The feature extractor generates high-dimensional features z = f (x, θ) for any input x, while the classifier produces model predictions h(z, φ) based on z.Therefore, with the assumption of training a k-class classification network on a dataset D = {(x i , y i )} N i=1 containing N samples, x i ∈ R H×W represents the i-th training instance and y i = [k] = {1, 2, . . . ,k} is the corresponding ground-truth (GT) label.Most classification tasks are performed using the CE loss as shown in Equation (1), minimizing L ce to fit the DNN to all given labels.
where p(y i |x i ) is the y i -th component of the prediction p(x i ) = softmax(h( f (x i , θ), φ)) for the input x i .However, when there are mislabeled samples in the dataset, i.e., y i ̸ = y i (let D = {(x i , y i )} N i=1 and y i = [k] denote the noisily labeled dataset and noisy label), from the perspective of gradient contribution [29,30], it has been shown that samples with Entropy 2024, 26, 589 6 of 29 noisy labels carry greater weight compared with those with clean labels as convergence progresses, rendering this paradigm unreliable [27].Therefore, NLL emerges.
An overview of the proposed framework is shown in Figure 1.Our framework is similar to the existing sample selection with SSL techniques, employing two identical DNNs that are trained alternately.Like PLReMix [7], each DNN comprises a feature extractor f •, θ (m) and a classifier h •, φ (m) for semi-supervised classification tasks, along with an additional projection head g •, ϕ (m) to map high-dimensional features z to lowdimensional embedding q.Here, θ (m) , φ (m) , and ϕ (m) are the corresponding parameters, and m ∈ {0, 1} denotes the network index.We pre-train both models using CE loss.To address asymmetric noise scenarios, we introduce an additional penalty term [6,9,30,31] based on the prediction confidence to promote a more uniform loss distribution, facilitating GMM modeling.This penalty term for the m-th model is given as follows: Here, is the softmax prediction of the m-th network for the input x i .In the next two sections, we will provide a detailed explanation of the two key processes discussed in this article, i.e., BP-GMM and SSO-PLR.
where ( ) for the input i x .However, when there are mislabeled samples in the dataset, i.e., i i y y ≠  (let denote the noisily labeled dataset and noisy label), from the perspective of gradient contribution [29,30], it has been shown that samples with noisy labels carry greater weight compared with those with clean labels as convergence progresses, rendering this paradigm unreliable [27].Therefore, NLL emerges.
An overview of the proposed framework is shown in Figure 1.Our framework is similar to the existing sample selection with SSL techniques, employing two identical DNNs that are trained alternately.Like PLReMix [7], each DNN comprises a feature extractor denotes the network index.We pre-train both models using CE loss.To address asymmetric noise scenarios, we introduce an additional penalty term [6,9,30,31] based on the prediction confidence to promote a more uniform loss distribution, facilitating GMM modeling.This penalty term for the m-th model is given as follows: Here, x is the softmax prediction of the m-th network for the input i x .In the next two sections, we will provide a detailed explanation of the two key processes discussed in this article, i.e., BP-GMM and SSO-PLR.

Balanced Partitioning Process
After the warm-up stage, at the beginning of each epoch, we first divide the entire dataset D into a labeled set X (m) and an unlabeled set U (m) through this process for each network m ∈ {0, 1}.In the labeled set X (m) , the original label of each sample is considered to be nearly correct, so we retain its label; whereas, in the unlabeled set U (m) , the original label of each sample is deemed incorrect, thus we remove its label to alleviate model overfitting.Then, we separately calculate the classification cross-entropy (CCE) loss and the prototype cross-entropy (PCE) loss for each sample under the two models to fit a two-component two-dimensional GMM.We use the GMM to estimate the posterior probability of samples being clean labels.Figure 2  .In the labeled set ( ) m X , the original label of each sample is con- sidered to be nearly correct, so we retain its label; whereas, in the unlabeled set ( ) m U , the original label of each sample is deemed incorrect, thus we remove its label to alleviate model overfitting.Then, we separately calculate the classification cross-entropy (CCE) loss and the prototype cross-entropy (PCE) loss for each sample under the two models to fit a two-component two-dimensional GMM.We use the GMM to estimate the posterior probability of samples being clean labels.Figure 2 illustrates the BP-GMM process.Assuming we currently compute two types of losses based on model m, then the CCE is the de-mean of Equation ( 1) (e.g., ( ) x ); the CCE measures how well the network fits sample labels, which is consistent with Equation ( 1), except that the GT label i y is replaced by the observed label i y  .Modeling the CCE of each sample using GMM can fully utilize the class information they carry.Furthermore, the PCE represents the semantic-level potential category probability distribution between the low-dimensional embedding ( ) m i q of the sample i x and all class prototypes Here, ( ) Q is the prototype of the c-th class and is defined as the mean center of low-dimensional embeddings with the same semantic information.The initialization and update methods are detailed in Equations ( 18) and ( 21) of Section 3.3.Here, we assume that all class prototypes

Q
for the current epoch have been obtained.Consequently, the PCE of instance i x is denoted as Here, i  y is the one-hot representation of the observed label i y  and notes the normalized cosine similarity matrix.Subsequently, the c-th component of ( ) , can be calculated according to Equation (4): Assuming we currently compute two types of losses based on model m, then the CCE is the de-mean of Equation ( 1) (e.g., L (m) ce,i = − log(p(y i |x i ))); the CCE measures how well the network fits sample labels, which is consistent with Equation ( 1), except that the GT label y i is replaced by the observed label y i .Modeling the CCE of each sample using GMM can fully utilize the class information they carry.Furthermore, the PCE represents the semantic-level potential category probability distribution between the low-dimensional embedding q (m) i of the sample x i and all class prototypes is the prototype of the c-th class and is defined as the mean center of lowdimensional embeddings with the same semantic information.The initialization and update methods are detailed in Equations ( 18) and ( 21 Here, ỹi is the one-hot representation of the observed label y i and d denotes the normalized cosine similarity matrix.Subsequently, the c-th component of d i , can be calculated according to Equation (4): Here, d (m) i,c represents the distance between the embedding q (m) i and Q (m) c , which is adopted from [7,37].In an ideal scenario, under the influence of a proficient feature extractor f •, θ (m) and projection head g •, ϕ (m) , the mapping embeddings of samples with similar semantic information should form a cluster, with the center of this cluster representing the corresponding class prototype.In such a case, if the given label y i for instance pair (x i , y i ) ∈ D does not match its GT label y i , then its distance d .Consequently, the prototype cross-entropy loss for this instance would be greater than that for other instances with the same observed label where the observed label matches the true label.Therefore, the semantic information carried by training data can also be fully employed by fitting GMM to the PCE loss.
After obtaining these two types of losses based on model m, a two-component twodimensional GMM is trained to fit the distribution .
Since samples with clean labels typically have smaller losses, it has been confirmed in the literature [6-9] that the mean center of the loss distribution formed by them is closer to 0 compared to noisy samples.Therefore, following the small-loss criterion, after modeling the GMM, we choose the component with the smallest mean from the two components and utilize the corresponding Gaussian model to estimate the posterior probability of each sample having a clean label.Here, we denote the posterior probability of this pair i .According to Equation ( 5), the posterior probabilities of samples for each class are sorted in descending order, and the sorted set of posterior probabilities at the class level is denoted as Here, sort(•) is the sorting function in descending order, and only the samples with observed labels { y i } N i=1 belonging to category c will be sorted into c .Subsequently, we determine whether the posterior probability of all samples exceeds the predefined threshold τ s ∈ [0, 1], and we count the number of samples exceeding τ s as Here, 1(•) is an indicator function that returns 1 only when the condition (e.g., w (m) i ≥ τ s ) is met.We perform sample selection at the class level, as shown in Equations ( 6) and ( 7 are selected for the labeled set X (m) : Here, R (m) c represents the selected labeled samples within the c-th class and can be denoted as follows: The unlabeled set U (m) is obtained as follows: Previous methods [6-9] that used 1d-GMM or 2d-GMM to estimate posterior probabilities for sample partitioning have overlooked the class imbalance in the labeled subsets after the selection process.We propose a method called BP-GMM, which combines a balancing partition mechanism with a 2d-GMM to address this issue.As shown in Figure 3, we present the number of true positive (TP) and false positive (FP) samples within each class of the labeled subsets partitioned using BP-GMM and several representative methods (such as PLReMix [7], UNICON [28], and LongReMix [8]).From Figure 3, it is evident that BP-GMM not only maintains class balance in the partitioned labeled subsets but also increases the Even so, in the selected labeled subset, the number of TP samples is significantly lower compared with the results obtained using our method, except for the 50%-sym.scenario, where the results of the two methods are close.
ancing partition mechanism with a 2d-GMM to address this issue.As shown in Figure 3, we present the number of true positive (TP) and false positive (FP) samples within each class of the labeled subsets partitioned using BP-GMM and several representative methods (such as PLReMix [7], UNICON [28], and LongReMix [8]).From Figure 3, it is evident that BP-GMM not only maintains class balance in the partitioned labeled subsets but also increases the number of true positive samples in each class.Although UNICON also addresses class imbalance, resulting in balanced samples after selection, its use of Jensen-Shannon divergence (JSD) to partition based solely on class information neglects semantic information.Even so, in the selected labeled subset, the number of TP samples is significantly lower compared with the results obtained using our method, except for the 50%sym.scenario, where the results of the two methods are close.After partitioning the labeled subsets and the unlabeled subsets based on two models using the BP-GMM process, as illustrated in Figure 1a, the two subsets divided by the mth model will be utilized in the SSL training of the (1-m)-th model in the SSO-PLR process.Similarly, the m-th model employs the two subsets divided by the (1-m)-th model for SSL training.Through this co-teaching strategy, the accumulation of error flows for each model is significantly alleviated [6, 25,26].The following section will explain the SSO-PLR process and the initialization and updating methods of class prototypes.

Semi-Supervised Oversampling Training Process
In this section, we illustrate the details of the SSO-PLR process.As shown in Figure 1a, we alternately train two models.Assuming the current training is for the m-th network, the two subsets,   [8], PLReMix [7], UNICON [28] and our method, respectively.(b) Comparison using CIFAR10 dataset with 40% asymmetric noisy labels.(c) Comparison using CIFAR10 dataset with 90% symmetric noisy labels.(d) Comparison using CIFAR10 dataset with 49% symmetric noisy labels.
After partitioning the labeled subsets and the unlabeled subsets based on two models using the BP-GMM process, as illustrated in Figure 1a, the two subsets divided by the m-th model will be utilized in the SSL training of the (1-m)-th model in the SSO-PLR process.Similarly, the m-th model employs the two subsets divided by the (1-m)-th model for SSL training.Through this co-teaching strategy, the accumulation of error flows for each model is significantly alleviated [6, 25,26].The following section will explain the SSO-PLR process and the initialization and updating methods of class prototypes.

Semi-Supervised Oversampling Training Process
In this section, we illustrate the details of the SSO-PLR process.As shown in Figure 1a, we alternately train two models.Assuming the current training is for the m-th network, the two subsets, X (1−m) and U (1−m) , that it uses are derived from the partition results of the (1−m)-th network.As illustrated in Figure 4, we employ an SSL framework similar to the previous sample selection methods [6][7][8][9] but with the addition of an oversampling strategy and PLR loss to further enhance the robustness and classification performance of the networks.SSL loss.Taking the labeled subset ( ) X − as the primary sampling target, we first sample a mini-batch of the same size from the unlabeled subset ( ) Here, t represents the count of batch sampling from the labeled set ( ) depicted in Figure 2, although the majority of samples in the labeled subset are clean, it unavoidably introduces some instances with noisy labels, leading to incomplete label reliability.Additionally, the original labels of samples in the unlabeled subset are unreliable.Hence, after applying weak augmentation ( ) wk ⋅ twice to each input i x in the two mini- batches, we generate pseudo-labels ˆi y for them using Equation (10).( ) ( ) ( ) where ij y is the j-th component of the soft label i y and T is the sharpening coefficient SSL loss.Taking the labeled subset X (1−m) as the primary sampling target, we first sample a mini-batch B t l = x i , y i , w of size b from it and a mini-batch of the same size from the unlabeled subset U (1−m) .Here, t represents the count of batch sampling from the labeled set X (1−m) in the current epoch e.As depicted in Figure 2, although the majority of samples in the labeled subset are clean, it unavoidably introduces some instances with noisy labels, leading to incomplete label reliability.Additionally, the original labels of samples in the unlabeled subset are unreliable.Hence, after applying weak augmentation wk(•) twice to each input x i in the two mini-batches, we generate pseudo-labels ŷi for them using Equation (10).
Entropy 2024, 26, 589 Here, ỹi is the one-hot representation of y i and sp(•) is a sharpen function used in previous works.The calculation of this sharpen function is as follows: where y ij is the j-th component of the soft label y i and T is the sharpening coefficient that is preset to 0.5.Next, we apply two rounds of strong augmentation (i.e., stg(•)) to each input x i from B t l and B t ul , respectively, and concatenate the augmented results in sequence to form two new batches B t l,stg = x stg i , ŷi We then apply the Mixup [16] operation to each pair from the union of B t l,stg to improve the models' generalization and robustness, which is shown as follows: Here, the Mixup operation results for the input pair x stg i , ŷi are denoted as x ′ i , ŷ′ i , where λ is a dynamic value randomly sampled from the beta distribution Beta(β) with a predefined factor β and j represents a random permutation of the indices in B t l,stg and B t ul,stg .Hence, we can denote the results in Equation ( 12) from two batches B t l,stg and B t ul,stg as , respectively.Subsequently, we compute the semi-supervised loss for each pair from B t l,mix and B t ul,mix as follows: PLR loss.Currently, some sample selection methods not only employ Equation (13) to train networks but also utilize additional unsupervised CRL techniques to learn each pair from B t ul .In unsupervised CRL, each sample's two transformations are treated as positive pairs, while transformations of all other samples in the batch serve as negative pairs.By leveraging InfoNCE [34,35], the similarity of positive embeddings is enhanced, while that of negative embeddings is diminished.However, in the batch, there might be some negative embeddings that align with the GT labels of the positive embeddings.In such cases, unsupervised CRL attempts to increase the distance between these negative embeddings and the positive embeddings, leading to a conflict with the optimization goal in Equation (13).Ref. [7] has demonstrated that this conflict significantly affects network performance.Therefore, this paper introduces PLR loss to help the feature extractor better learn from unlabeled information without disturbing the classifier.For the input x ′ i , following the method described in reference [7], we construct a reliable negative sample set O t i (illustrated in Equation ( 14)) by removing instances from the union of B t l,mix and B t ul,mix that have the same potential GT labels as x ′ i .Given the model's prediction p (m) (x i ) (without augmentation and Mixup operation) for input x ′ i (it is the result of Mixup operation for x i , the same index as x i in the dataset D), we first determine the top n indices with the highest prediction probabilities of p (m) (x i ) and their corresponding observed labels y i , denoted ∪ y i .Then, we include all instances from the union of B t l,mix and B t ul,mix in O t i according to Equation (14).
Subsequently, we compute the PLR loss using vanilla InfoNCE based on this negative sample set: The final optimization objective of SSO-PLR is: Oversampling.Existing semi-supervised sample selection methods sample a labeled mini-batch from X (1−m) with a predefined batch size b during training, then sample an unlabeled mini-batch from U (1−m) with the same size.After the Mixup operation, CRL loss (if any) and SSL loss are computed and backpropagated.Training of the current epoch e ends immediately when all labeled samples have been sampled, i.e., t ≜ X (1−m) /b.However, in most noisy scenarios, due to the much larger size of the unlabeled subset partitioned U (1−m) by sample selection methods compared with the labeled subset X (1−m) (as shown in Figure 5), training is interrupted before many unlabeled samples are sampled, causing the model to miss the opportunity to learn from a large amount of unlabeled sample information [8].
Consequently, as shown in Figure 4, we introduce an oversampling mechanism.If the current sampling count t has not reached the maximum sampling times of the original dataset (|X| + |U|)/b, we continue sampling from the labeled and unlabeled sets, and training for the current epoch only stops when all unlabeled samples have been trained at least once.Resampling the clean data subset not only allows the model to learn information carried by all unlabeled samples but also reduces the boundary for vicinal risk minimization (as shown in Theorem 8 of [42], Section 5.4 of [8]), thereby enhancing the accuracy and robustness of semi-supervised classification.
The final optimization objective of SSO-PLR is: Oversampling.Existing semi-supervised sample selection methods sample a labeled mini-batch from ( )  It is evident that as the noise rate increases (e.g., from CIFAR-10 with 40%-asym.To 80% sym., and CIFAR-100 with 40%-asym.To 50%-sym., etc.), the quantity of unlabeled samples significantly surpasses that of labeled samples.It must be noted that this tendency becomes more pronounced as the number of categories Increases (e.g., from CIFAR-10 with 80%-sym.To CIFAR-100 with 50%-sym.).

Calculation of Class Prototypes
We maintain a class prototype set for each model m = {0, 1}.Let us assume the current model is the m-th one.After a certain number of training iterations (usually more than 10 epochs), this network has preliminarily converged and demonstrates basic classification performance.As class prototypes are defined as the mean centers of low-dimensional embeddings with the same semantic information, these embeddings form clusters around their corresponding class prototypes.Following prior research [7,37], at the end of the warm-up, we partition all samples from the whole noisy dataset D into k subsets D c k c=1 based on their observed labels { y i } N i=1 , which is shown in Equation ( 17): Then, for each subset D c , the low-dimensional embeddings extracted by the feature extractor f •, θ (m) and the projection head g •, ϕ (m) are accumulated and averaged to form the corresponding class prototype c , as expressed below: At the end of each epoch after warm-up, we update all class prototypes Q (m) c k c=1 using the momentum updating method.First, we utilize the predictions p (m) (x i ) of the model m and the similarity d (m) i measured by Equation ( 4) for estimating the latent GT labels: where δ i is the estimated latent label for the input x i and α = 0.5 is a predefined coefficient to control the contribution of predictions for estimated labels.This process aims to maximize the utilization of information from both label and feature spaces.Subsequently, we utilize Equation (20) to determine the true classes of samples and select high-confidence samples to update class prototypes, aiming to further mitigate the impact of noisy samples on class prototype updates.
Here, υ = 0.8 is a fixed threshold for performing label correction.Consequently, we update the class prototypes Q (m) y pro i using the embedding q (m) i and the estimated hard label y pro i , which is shown as follows: Here, ς = 0.99 is the momentum coefficient, Norm(•) represents the normalization function, and mean(•) is the mean function.
It should be noted that momentum updates are not performed during the warm-up stage (only initializing the class prototypes for each network according to Equation ( 18) after warm-up).After the SSO-PLR training process of each network, we sequentially execute the momentum update process for class prototypes as described in Equation ( 21).

Pseudo-Code
The pseudo-code of our method is illustrated in Algorithm 1, and an overall framework is shown in Figure 1.//execute the BP-GMM process using from Equation (3) to Equation ( 9) //for network m = 0 perform coarse data division using two-dimensional GMM (Equation (3) to Equation ( 4)); //It is the same as PLReMix perform the proposed class-level balanced selection on the coarse division results using (Equations ( 5)-( 9)); //It is different from PLReMix generate labeled subset X (0) and unlabeled subset U (0) ; //for network m = 1 perform coarse data division using two-dimensional GMM (Equations ( 3) and ( 4)); //It is the same as PLReMix perform the proposed class-level balanced selection on the coarse division results using (Equations ( 5)-( 9)); //It is different from PLReMix generate labeled subset X (1) and unlabeled subset U (1) ; //execute the SSO-PLR process for network m = 0 to 1: //oversampling strategy, it is different from PLReMix sampling a labeled mini-batch B t l and an unlabeled mini-batch B t l from X (1−m) and U (1−m) , respectively; perform label-refinement and co-guessing operation using Equation (10); //generate pseudo labels for all samples do Mixup augmentation for two mini-batches using Equation (12); //enhance model generalization and robustness calculate the SSL loss and PLR loss through Equations ( 13) and (15); perform backpropagation according to Formula ( 14) to update all parameters of current network; t + +; //the increment of t end if //all the unlabeled samples are completely sampled end for //update all the class prototypes, it is the same as PLReMix for network m = 0 to 1: estimate latent GT labels based on current network using Equations ( 19) and (20); perform momentum updates for the class prototypes belonging to the current network using Equation ( 21); end for e + +; //the increment of epoch counter e

end while
Output: two robust networks m = 0, 1; two labeled subsets X (m) with relatively low noise rates.

Experiments 4.1. Datasets and Experimental Settings
Following previous research, such as [6-9,30,31], etc., we validated the performance of our approach on two synthetic noisy datasets (i.e., CIFAR-10 and CIFAR-100) and two realworld noisy datasets (i.e., Animal-10N and Clothing1M).The experiments covered various noise scenarios, and both coarse-grained and fine-grained datasets were validated.For the backbone (i.e., feature extractor f and classifier h) used in each dataset, we introduce an additional projection head g comprising two linear layers and one normalization layer.This head aims to transform the features outputted by the penultimate layer of the backbone network into a low-dimensional space of dimension 128, aiming to obtain a more compact embedding.The summary of the datasets used in this paper is demonstrated as follows: CIFAR-10 [40].The basic information of this dataset is shown in Table 1.Since all labels in the dataset are accurate (clean), we consider two types of synthetic noise labels: symmetric and asymmetric.By artificially synthesizing noisy labels, we can simulate scenarios such as label errors or confusion in the real world, thereby evaluating and improving the robustness of NLL methods in noisy environments.Symmetric noise randomly flips the labels of r% (i.e., noise rate) samples from each class to all other classes in a uniform distribution.Asymmetric noise simulates label confusion scenarios, mainly by flipping r% truck class samples to automobile, r% bird class samples to airplane, interchanging samples between the cat and dog categories, etc.We considered five symmetric noise scenarios, where r% takes values of 20%, 50%, 80%, and 90%, as well as four asymmetric noise scenarios, where r% takes values of 10%, 20%, 30%, 40%, and 49%.To ensure a fair comparison with previous methods, we employed the PreAct ResNet-18 [43] as the backbone.Table 2 shows the experimental settings of the method in this paper.To illustrate the robustness of our approach, we employed nearly identical parameter configurations across all noise scenarios.Despite prior studies suggesting that the parameter λ u should vary depending on noise rates and types, we opted for a fixed value of λ u = 30.The only exception occurred in low noise rate scenarios, such as 20%-sym.And 10% to 30%-asym., etc., where we set λ u = 0.This approach aligns with common sense, as lower noise rates should correspond to weaker regularization capabilities for unlabeled samples.Additionally, the learning rate lr linearly decays to 2 × 10 −4 within the first 380 epochs and remains fixed thereafter.We decrease the n used in top i n from 3 to 2 after 40 epochs.CIFAR-100 [40].The basic information of this dataset is also shown in Table 1.Following previous studies, we still consider both symmetric and asymmetric noise labels in this dataset.The generation of symmetric noisy labels is consistent with CIFAR-10 while the generation of asymmetric noisy labels involves flipping r% of samples from each category to the next similar category within its superclass.We considered r = 20, 50, 80, and 90 for symmetric noise scenarios and 10, 20, 30, and 40 for asymmetric scenarios.Table 2 shows the experimental settings in this paper.The λ u is still fixed as 30, except for 10% -asym.And 90%-sym., where λ u = 0 and 150, respectively.The adjustment of the learning rate lr is the same as CIFAR-10, comprehensively demonstrating the robustness of our method.The setting of n is the same as CIFAR-10.
Animal-10N [44].This is a fine-grained real-world noise dataset, comprising 10 classes of animal data.The noise rate of this dataset is approximately 8%.The basic information is outlined in Table 1.The Vgg-19N [45] is utilized as the backbone.Table 2 shows the experimental settings in this paper.To illustrate the robustness of our approach, we employed nearly identical parameters to CIFAR-10 and set λ u to 0. The learning rate lr was reduced by 10 and 100 after 80 and 140 epochs, respectively.The setting of n was the same as CIFAR-10.Additionally, to ensure a fair comparison with some co-teaching-based methods, we also present experimental results based on the 9-layer CNN [25,26].The hyperparameter settings of this backbone are identical to those of VGG-19N, demonstrating that our approach is insensitive to model architecture.
Clothing1M [41].This is also a real-world noise dataset with nearly 38.4% noisy labels, comprising 14 categories of clothing images.Table 1 illustrates the summary of this set.ResNet-50 [43] pretrained with the ImageNet dataset is the backbone.Table 2 shows the experimental settings in this paper.Due to our adherence to previous methods that randomly balanced sampled 64K data for training in each epoch, during training, we added PLR loss to pre-train the projection head and feature extractor.Additionally, we performed model performance calibration using CE loss every 5 epochs.The hyperparameters were the same as Animal-10N.The learning rate lr was reduced by 10 per 40 epochs.We decreased the n from 3 to 2 and 1 after 15 and 30 epochs, respectively, which is also the same as [7].

Experiments on Synthetic Noisy Datasets
This section illustrates the performance variations in BPT-PLR on the CIFAR-10 and CIFAR-100 across various noise types and rates and compares our method with various SOTA methods from 2018 to 2024.All the results of our method are the means of two independent experiments.

Results on CIFAR-10
Following the validation methodology established in the NLL field, we demonstrate the robustness and generalization of our method on CIFAR-10 using synthetic symmetric and asymmetric noisy labels.Table 3 presents the comparison of our method and some SOTA methods on CIFAR-10 with various noise types and rates.To demonstrate our approach's robustness, we provide the average test accuracy of the last 10 epochs (denoted as the last) and the best test accuracy across all epochs (denoted as the best).The results reported are the means of two independent experiments.Firstly, as evident from the results of the standard CE method in Table 3, the DNN trained solely on CE loss was not able to withstand noisy labels, leading to performance degradation.Secondly, we list the results of certain representative NLL methods with outstanding performance from 2018 to 2022, such as co-teaching, DivideMix, ELR+, UNICON, Mixup, and PENCIL.To thoroughly illustrate the robustness of our method, we specifically compare it with recent NLL methods, including LongReMix, OT-Filter, DISC, ScanMix, C2MT, SLRLNL, RL, PLReMix, and HMW+.Table 3 shows that many SOTA methods achieve excellent and comparable performance under low noise rates, regardless of symmetric or asymmetric noise scenarios (e.g., from 20% to 50% symmetric noise, and 40% asymmetric noise).Nevertheless, our method still achieves optimal performance and significantly outperforms these methods.For instance, in the 20% symmetric/40% asymmetric noise scenario, our method surpasses UNICON, LongReMix, OT-Filter, DISC, ScanMix, C2MT, and PLReMix by margins of 1.0%/1.56%,0.7%/0.96%,1.0%/0.51%,0.9%/1.06%,1.0%/1.96%,0.5%/2.7%,and 0.37%/0.55%,respectively.While HMW+ is an improvement based on the UNICON framework, its accuracy did not significantly improve compared with the source framework.In contrast, our method outperforms PLReMix by a noticeable margin.This clearly demonstrates the effectiveness of the two key steps introduced in our approach.
As the noise ratio increases, the superiority of our method becomes increasingly evident.For example, in scenarios with 90% symmetric/49% asymmetric noise, our method outperforms UNICON, LongReMix, OT-Filter, DISC, ScanMix, and PLReMix by margins of 3.27%/1.9%,13.2%/5.2%,3.5%/1.06%,39%/16.9%,3%/0.96%, and 2.1%/3.4%,respectively.These methods exhibit varying degrees of overfitting in the 49% asymmetric noise scenario, as indicated by the substantial gaps between the last and best results, such as 6.6% for LongReMix, 0.9% for OT-Filter, 3.7% for DISC, and 31% for PLReMix.However, our method demonstrates excellent robustness, with a difference of only 0.17%.PLReMix uses flat NCE instead of non-flat NCE to design a flat PLR loss specifically for CIFAR-10 and CIFAR-100, aiming to improve the accuracy of the algorithm, hence referred to as Flat-PLReMix here.However, our approach only employs the original PLR loss across all datasets to demonstrate its robustness.Nevertheless, the performance of our method on CIFAR-10 still significantly surpasses that of Flat-PLReMix.This clearly demonstrates the effectiveness of the two key processes (i.e., BP-GMM and SSO-PLR) proposed in this paper.Furthermore, since many methods only provide results for the 40% asymmetric noise scenario, to thoroughly demonstrate the robustness of our method, we report the results of these methods in the 10% to 30% asymmetric noise scenarios based on publicly available code and compare them with our method.The experimental results further confirm the effectiveness of our method.Finally, in Figure 4, we present the test accuracy curves of our method and some SOTA methods.It can be observed from the figures that our method maintains steady progress, demonstrating its effectiveness in resisting noisy labels as training progresses.Combining Figure 6 with Figures 3 and 5, it is evident that our proposed BP-GMM process divides the data into labeled subsets, with sizes closer to the true clean rate.Within the labeled subset, there are more TP samples for each category, while the number of FP samples is relatively low.As a result, we achieve better test performances.Since the SSO-PLR technique proposed in this paper combines oversampling strategies with PLR loss to extract more information from unlabeled samples, it enhances multiple learning on clean samples, accelerating convergence speed (i.e., the steeper test accuracy curves in Figure 6c,d) and enhancing final test results.

Results on CIFAR-100
To further highlight the advantages of our method, we validate our approach on the CIFAR-100 dataset with varying synthetic noisy labels, which is the same as previous SOTA methods.Table 4 provides a comparison of our method with some SOTA methods on this dataset.Consistent with the experiments on CIFAR-10, we also present both the "last" and "best" results.Firstly, from the testing results of CE, co-teaching, Mixup, and PENCIL in scenarios where the noise ratio is greater than or equal to 30%, regardless of symmetric or asymmetric noise, it can be observed that as the number of classes increases, the impact of noisy labels on the model becomes more severe.DNNs trained with these methods almost lose discriminative ability and are replaced by random guessing (i.e., the test accuracies of these methods are below 50%).Furthermore, recent SOTA methods such as DivideMix, UNICON, DISC, LongReMix, RL, C2MT, PLReMix, and HMW+ have achieved significant improvements on CIFAR-100 across varying noise scenarios.However, in most scenarios, our method still shows improvements over these methods.For instance, in low-noiserate scenarios, specifically 20%-sym.And 20%-asym., our method leads by 1.55%/9.5%,0%/0.84%,0.1%/0.64%,1.85%/-, 0.06%/−0.68%,1.35%/1.2%,0.9%/-, and 2.2%/2.4%,respectively.Notably, RL utilizes a deeper ResNet-34 for training on this set, while the difference in the results between PreAct ResNet-18 and ResNet-34 for PENCIL shows that as the model gets deeper, the test performance improves.Nevertheless, our method still achieves comparable or slightly superior results to RL, indicating its advantage.Similar to observations on CIFAR-10, as the noise ratio increases, our method continues to optimal (e.g., in 50%-sym., 80%-sym., 10%, and 30%-asym.Scenarios) or near-optimal results (e.g., second only to ScanMix in the 90%-sym.Scenario), except for the 40%-asym.Scenario.In this case, our method lags behind RL, OT-Filter, DISC, etc., by nearly 2.2%.
Entropy 2024, 26, x FOR PEER REVIEW 20 of 31 of these methods in the 10% to 30% asymmetric noise scenarios based on publicly available code and compare them with our method.The experimental results further confirm the effectiveness of our method.Finally, in Figure 4, we present the test accuracy curves of our method and some SOTA methods.It can be observed from the figures that our method maintains steady progress, demonstrating its effectiveness in resisting noisy labels as training progresses.Combining Figure 6 with Figures 3 and 5, it is evident that our proposed BP-GMM process divides the data into labeled subsets, with sizes closer to the true clean rate.Within the labeled subset, there are more TP samples for each category, while the number of FP samples is relatively low.As a result, we achieve better test performances.Since the SSO-PLR technique proposed in this paper combines oversampling strategies with PLR loss to extract more information from unlabeled samples, it enhances multiple learning on clean samples, accelerating convergence speed (i.e., the steeper test accuracy curves in Figure 6c,d) and enhancing final test results.Test accuracy curves of our method across varying asymmetric noise rates.(c) The comparison of test accuracy between five STOA methods (i.e., DivideMix, UNICON, LongReMix, PLReMix, and C2MT) and our method in the scenario of 80% symmetric noise.These methods were originally set to train for 300 epochs, while our method followed the parameter settings of PLReMix, which are set to 400 epochs.(d) The comparison of test accuracy between these STOA methods and our method in the scenario of 30% asymmetric noise.) and our method in the scenario of 80% symmetric noise.These methods were originally set to train for 300 epochs, while our method followed the parameter settings of PLReMix, which are set to 400 epochs.(d) The comparison of test accuracy between these STOA methods and our method in the scenario of 30% asymmetric noise.Our analysis suggests that the main issue lies in these methods adjusting hyperparameters dynamically based on noise types and ratios, while we maintain nearly consistent parameter settings across all experiments.Additionally, as the asymmetric noise ratio on CIFAR-100 rises to 40%, the number of clean and noisy samples per class becomes almost equal (300:200), posing a challenge to calculating a reliable negative set for PLR loss.It is one of our focal points for future research.Despite this, our method still achieves suboptimal performance compared with UNICON, far outperforming LongReMix, SLRLNL, HMW+, and others.Moreover, such extreme cases are rare in real-world scenarios, as most datasets have a large number of samples per class (greater than 1000), resulting in a significant gap between the numbers of clean and noisy samples per class, even with large categories and high noise ratios.Therefore, we can conclude that our method is suitable for most noisy scenarios with a large number of categories and demonstrates good robustness and classification performance.Similarly, in Figure 7, we present the test accuracy curves of our method and some SOTA methods.It can be observed from the figures that our method maintains steady progress, demonstrating its effectiveness in resisting noisy labels as training progresses.Combining the results of Figure 5c,d, it can be seen that the sizes of the unlabeled subsets we partitioned (referred to as noisy label sets) are closer to the true noise rates.The analysis combining Figures 3 and 5 clearly shows that both the BP-GMM process and the SSO-PLR process still have a certain effect on noisy datasets with a larger number of categories.Therefore, the test curve of our method is steeper and higher compared with the test accuracy curves of several SOTA methods shown in Figure 7c,d.

Results on CIFAR-100
are closer to the true noise rates.The analysis combining Figures 3 and 5 clearly shows that both the BP-GMM process and the SSO-PLR process still have a certain effect on noisy datasets with a larger number of categories.Therefore, the test curve of our method is steeper and higher compared with the test accuracy curves of several SOTA methods shown in Figure 7c,d ) and our method in the scenario of 80% symmetric noise.These methods were originally set to train for 300 epochs, while our method followed the parameter settings of PLReMix, which are set to 400 epochs.(d) The comparison of test accuracy between these SOTA methods and our method in the scenario of 30% asymmetric noise.

Experiments on Real-World Noisy Datasets
We have conducted extensive experiments on the CIFAR-10 and CIFAR-100 datasets, demonstrating the effectiveness of our method.In this section, we apply it to two realworld noise datasets crawled from websites to further validate its performance.We conducted experiments on the Animal-10N and Clothing1M datasets, and the experimental analysis is below.

Results on Animal-10N
Since SOTA methods mainly employ two network architectures (e.g., 9-layer CNN and VGG-19N) for Animal-10N evaluation, we simultaneously provided the test results of the BPT-PLR method based on these two networks in Table 5.Additionally, we reported the results of LongReMix and PLReMix using publicly available code on this dataset.Due to PLReMix utilizing the original PLR loss on real-world noisy datasets in the reference, it is denoted as N-Flat-PLReMix (non-flat PLReMix).From the table, it is evident that our method achieved the best performance across two network architectures.Our method outperforms TCC-net and C2MT by 4.0% and 2.5%, respectively, on the 9-layer CNN, and surpasses OT-Filter, DISC, LongReMix, C2MT, SLRLNL, and HMW+ by at least 1% on Vgg-19N.Although the best accuracy of PLReMix is close to ours (with a difference of approximately 0.3%), its last accuracy significantly lagged behind our method (with a difference of approximately 0.75%).C2MT) and our method in the scenario of 80% symmetric noise.These methods were originally set to train for 300 epochs, while our method followed the parameter settings of PLReMix, which are set to 400 epochs.(d) The comparison of test accuracy between these SOTA methods and our method in the scenario of 30% asymmetric noise.

Experiments on Real-World Noisy Datasets
We have conducted extensive experiments on the CIFAR-10 and CIFAR-100 datasets, demonstrating the effectiveness of our method.In this section, we apply it to two real-world noise datasets crawled from websites to further validate its performance.We conducted experiments on the Animal-10N and Clothing1M datasets, and the experimental analysis is below.

Results on Animal-10N
Since SOTA methods mainly employ two network architectures (e.g., 9-layer CNN and VGG-19N) for Animal-10N evaluation, we simultaneously provided the test results of the BPT-PLR method based on these two networks in Table 5.Additionally, we reported the results of LongReMix and PLReMix using publicly available code on this dataset.Due to PLReMix utilizing the original PLR loss on real-world noisy datasets in the reference, it is denoted as N-Flat-PLReMix (non-flat PLReMix).From the table, it is evident that our method achieved the best performance across two network architectures.Our method outperforms TCC-net and C2MT by 4.0% and 2.5%, respectively, on the 9-layer CNN, and surpasses OT-Filter, DISC, LongReMix, C2MT, SLRLNL, and HMW+ by at least 1% on Vgg-19N.Although the best accuracy of PLReMix is close to ours (with a difference of approximately 0.3%), its last accuracy significantly lagged behind our method (with a difference of approximately 0.75%).Through the comparison experiments on Animal-10N, we further illustrated the two advantages of our method: maintaining stable and excellent performance across various noise scenarios and being insensitive to model structures, thus being compatible with most DNN networks.In Figure 8, we present the test accuracy curves of our method and some reproduced methods on this dataset.Similar to Figures 6 and 7, we still find that the test accuracy curve of our proposed method is steeper and higher than existing SOTA methods, which fully demonstrates its effectiveness in dealing with real-world fine-grained noisy datasets.

Results on Clothing1M
Table 6 presents the experimental results on the Clothing1M dataset.From the table, it can be observed that our method performs slightly worse than existing state-of-the-art methods.The core issue lies in adopting almost identical hyperparameter settings as those used on the Animal-10N dataset.Additionally, due to our adherence to the PLReMix approach, we randomly sample 64K data points for training at each epoch.Consequently, the BP-GMM process faces potentially different training sets in each epoch, diminishing the coherence of balanced partitioning.This inconsistency affects both the partition accuracy and subsequent PLR loss computation, resulting in a slight decrease in performance.Furthermore, while PLReMix slightly outperforms our approach, this advantage stems from its selective use of flat PLR and non-flat PLR tailored to different datasets.In contrast, we employed the same non-flat PLR loss across all datasets.Despite this, we still achieved nearexcellent performance, trailing the SOTA method (i.e., PLReMix, OT-Filter, and C2MT) by only approximately 0.1-0.2%.Considering that the Clothing1M dataset contains 1 million training samples, this performance gap can be considered negligible.Furthermore, we still outperformed many recent methods such as UNICON, DISC, and SLRLNL.Therefore, our proposed method is applicable to large-scale noisy datasets.
BPT-PLR (Ours) last 88.02best 88.28 Through the comparison experiments on Animal-10N, we further illustrated the two advantages of our method: maintaining stable and excellent performance across various noise scenarios and being insensitive to model structures, thus being compatible with most DNN networks.In Figure 8, we present the test accuracy curves of our method and some reproduced methods on this dataset.Similar to Figures 6 and 7, we still find that the test accuracy curve of our proposed method is steeper and higher than existing SOTA methods, which fully demonstrates its effectiveness in dealing with real-world finegrained noisy datasets.

Ablation Study
In this section, we conduct an ablation analysis on several key modules proposed in this paper to fully demonstrate their efficacy.Compared with the original PLReMix method, this paper mainly introduces two key processes: BP-GMM and SSO-PLR.In BP-GMM, we combine balanced partitioning with a two-dimensional GMM and perform sample selection based on both label and semantic information.Therefore, in the ablation experiments, we regard the balanced partitioning module as a key module, abbreviated as BP.Similarly, in SSO-PLR, we treat oversampling techniques and PLR loss as two modules, abbreviated as OS and PLR, respectively.We present the ablation experiment results on several key modules in Table 7.If the BP and OS columns are marked as "✗" in the corresponding experiment result row, it indicates that the corresponding module was not used in that experiment, and vice versa.The PLR column is slightly different; if marked as "✗", it indicates that we used the original CRL loss for both labeled and unlabeled samples, meaning the reliable negative class set O t i (Equation ( 14)) was not constructed; otherwise, it indicates the use of non-flat PLR loss (just utilized in all datasets, unlike PLReMix, where flat and non-flat PLR losses are dynamically employed based on dataset types).Analyzing the results in Table 7, we draw the following conclusions: Table 7. Ablation studies of our method.The best accuracies are shown in bold and we report last/best results where "✗" indicates the module is not employed while "✓" indicates the opposite."BP" represents a balanced partitioning module, and "OS" represents an oversampling module.The column "PLR" indicates the usage of CRL loss if it is "✗"; otherwise, the PLR loss described in this paper is employed."✗" represents the mean results between 80%-sym.And 40%-asym.Row #4 indicates the original PLReMix.Each result comes from one experiment.The effect of each module.From Rows #1 to #4 in Table 7, it is evident that using each module individually (such as Row #2 for BP, Row #3 for OS, and Row #4 for PLR) improved the average testing accuracy compared with the original method (i.e., Row #1) and also increased the risk of model overfitting.For instance, in Rows #3 (80%-sym.)and #4 (40%-asym.),the last results significantly lag behind the best, indicating the overfitting of DNNs in the later stages of training.This suggests that while individual modules enhance the model's robustness, their stability still needs improvement.

Modules
The effect of combining BP and OS.Although using OS alone may lead to model overfitting, we have demonstrated that combining it with BP results in mutual influence between the two modules, significantly enhancing the model's robustness and consistently improving testing accuracy.By comparing Rows #1 and #5, we observed that in two distinct noise scenarios, the combination of BP and OS improved performance by 0.9%/0.89%and 6.4%/6.3%,respectively.Additionally, the average testing accuracy increased by 3.2%/3.6%.This clearly underscores the necessity of utilizing both BP and OS modules simultaneously.Subsequently, comparing the results of using both BP and OS (Row #5) with those of using BP or OS alone (Rows #2 or #3), we found that introducing OS benefits the BP operation, further enhancing the performance of the model.
The effect of combining BP and PLR.Similar to the performance of OS, using PLR alone can lead to overfitting in scenarios with noisy labels.However, experiments in Row #6 demonstrated that combining PLR with BP can overcome this issue and consistently enhance model's robustness.Comparing the results of Rows #1 and #6, it is evident that in two different noise scenarios, this combination improves performance by approximately 0.1%/0.1% and 7%/7% compared with the original method.Additionally, the average testing accuracy Is Increased by 3.2%/3.6%.Furthermore, by comparing the results of Row #5 with Rows #2 or #4, we further confirm the necessity of combining BP and PLR.
The effect of combining OS and PLR.Similar to the experimental analysis above, when OS and PLR are combined, the testing performance of DNNs is significantly improved compared with the original method.Comparing the results of Rows #1 and #7, the combination improves performance by approximately 0.8%/0.7%and 2.3%/2.2%,respectively.Compared with Rows #3 and #4, although the improvement in testing performance of the two combinations is negligible, they mitigate the overfitting issues caused by using these two components separately, demonstrating the necessity of using OS and PLR simultaneously.
The effect of combining BP, OS, and PLR.Finally, we compare the results of using all three components introduced in this paper (i.e., the BPT-PLR framework, Row #8) with the optimal results from several other ablation experiments (i.e., Row #5).It was found that our method successfully overcame various issues mentioned above.It not only applies to scenarios with both asymmetric and symmetric noise but also enables the model to consistently maintain robustness and achieve optimal performance.Although our method performed slightly worse by 0.1% compared with using only BP and OS in the symmetric noise scenario, it outperformed other models by approximately 0.8% in the asymmetric noise scenario, resulting in a better average outcome than that of the experimental method.This fully demonstrates the necessity of using all three components simultaneously.
These experiments have analyzed the impact of each component introduced in this paper on the model's testing performance in different noise scenarios.Through quantitative analysis, we found that the more components introduced, the more stable the model's robustness.When all components are used simultaneously, we can obtain nearly optimal results, demonstrating the necessity of the framework proposed in this paper.Furthermore, Figure 9 presents the results of each ablation experiment in the form of testing accuracy curves, providing a more visual comparison of the changes in accuracy.Clearly, the BPT-PLR framework proposed in this paper (i.e., Row #8) maintains stable performance and achieves the best testing accuracies.Furthermore, it is evident that Rows #3 (using only OS) and #4 (using only PLR) in Figure 9a,b, during the late stages of training, begin to overfit on samples with noisy labels, resulting in a dramatic decline in test performance.This also explains the significant difference between the "last" and "best" results corresponding to these two methods in Table 7.This further illustrates the necessity of simultaneously utilizing the three modules proposed in this paper for the BPT-PLR method.
begin to overfit on samples with noisy labels, resulting in a dramatic decline in test performance.This also explains the significant difference between the "last" and "best" results corresponding to these two methods in Table 7.This further illustrates the necessity of simultaneously utilizing the three modules proposed in this paper for the BPT-PLR method.

Discussion
We validated the effectiveness of our proposed method through extensive experiments on four benchmark datasets.The comparative experiments shown in Tables 3 and 4 demonstrate the superior performance of our method on synthetic noise datasets, indicating its applicability to both fine-grained and coarse-grained noisy datasets.We illustrate the necessity of the proposed BP-GMM process in Figure 3, showing that it can improve the balance of labeled subsets after partitioning, increase the number of TP samples, and maintain or even reduce the FP samples.Additionally, we elaborated on the necessity of oversampling techniques in the SSL-based sample selection framework, as shown in Figure 5. Finally, in Tables 5 and 6, we provide the results of our approach on two real-world noisy datasets and compare them with several SOTA methods, further demonstrating the effectiveness of our framework.Moreover, we verified the robustness of our method to network structures and demonstrated its applicability to most DNN models, showcasing its broad utility.In Figures 6-8, we compare the test accuracy curves of our method with those of several SOTA methods, revealing not only a faster convergence rate (steeper curve) but also higher test performance.Combining the contents of Figures 3 and 5 further emphasized the necessity of the proposed two key processes.Finally, through extensive ablation experiments, we affirmed the effectiveness of several core modules utilized in the proposed key processes.From the ablation experiments, it is evident that although individual modules may not consistently improve the model's test performance, when used together, they mutually enhance and stabilize the model's test performance, underscoring the indispensability of these key modules.
Naturally, BPT-PLR has some limitations.For instance, as shown in Table 4, our method does not outperform existing methods in handling 40% asymmetric noise and 90% symmetric noise.Although such extreme noise scenarios are uncommon in practice, we still consider them as a focus for future research.Additionally, while we significantly outperformed existing methods on Animal-10N, we only marginally matched SOTA methods on Clothing1M, failing to surpass them completely.This indicates that while our method demonstrates certain robustness against various noise datasets, the robustness level is not consistently stable, which is also a point of consideration for future work.Finally, we plan to extend the two key processes proposed in this paper to the Out-of-Distribution (OOD) sample detection domain.

Figure 1 .Figure 1 .
Figure 1.Overall framework of the BPT-PLR.(a) Overall process: The training data are fed into two networks, A and B, for loss computation.In each network, the extracted features are used to calculate prototype loss together with class prototypes, while the output predictions are used to compute classification loss and observed labels.Subsequently, the BP-GMM process (i.e., Section 3.1) utilizes the semantic and label information carried by these two losses to balance the partitioning of the training data.In this process, the labeled subset X and the unlabeled subset U partitioned by network A are used by network B for the SSO-PLR process (i.e., Section 3.2), and vice versa.(b) Network structure: Each network consists of a feature extractor illustrates the BP-GMM process.datasetD  into a labeled set ( ) m X and an unlabeled set( )

Figure 2 .
Figure 2.An overview of the BP-GMM process.Firstly, similar to PLReMix, a 2D GMM model is constructed based on the classification and prototype losses of all samples to estimate the posterior probability of each sample belonging to clean labels.Unlike PLReMix, to reduce the number of false positive samples in the labeled set and achieve a more balanced category distribution, class-level balanced selection is conducted based on the estimated probabilities to ensure the sample quantities of each class in the labeled subset X are close, ultimately resulting in the labeled subset X and the unlabeled subset U .

Figure 2 .
Figure 2.An overview of the BP-GMM process.Firstly, similar to PLReMix, a 2D GMM model is constructed based on the classification and prototype losses of all samples to estimate the posterior probability of each sample belonging to clean labels.Unlike PLReMix, to reduce the number of false positive samples in the labeled set and achieve a more balanced category distribution, class-level balanced selection is conducted based on the estimated probabilities to ensure the sample quantities of each class in the labeled subset X are close, ultimately resulting in the labeled subset X and the unlabeled subset U.
) of Section 3.3.Here, we assume that all class prototypes Q (m) c k c=1 for the current epoch have been obtained.Consequently, the PCE of instance x i is denoted as Entropy 2024, 26, 589 9 of 29 number of true positive samples in each class.Although UNICON also addresses class imbalance, resulting in balanced samples after selection, its use of Jensen-Shannon divergence (JSD) to partition based solely on class information neglects semantic information.

Figure 3 .
Figure 3. Efficiency comparison of sample selection methods using the CIFAR10 dataset at 100 epochs with different proportions of noisy labels.(a) Comparison using CIFAR10 dataset with 50% symmetric noisy labels.TP refers to clean samples correctly selected into the labeled set, while FP refers to noisy samples mistakenly included in the labeled set.L, P, U, and O represent LongReMix[8], PLReMix[7], UNICON[28] and our method, respectively.(b) Comparison using CIFAR10 dataset with 40% asymmetric noisy labels.(c) Comparison using CIFAR10 dataset with 90% symmetric noisy labels.(d) Comparison using CIFAR10 dataset with 49% symmetric noisy labels.

−
, that it uses are derived from the partition results of the (1−m)-th network.As illustrated in Figure4, we employ an SSL framework similar to

Figure 3 .
Figure 3. Efficiency comparison of sample selection methods using the CIFAR10 dataset at 100 epochs with different proportions of noisy labels.(a) Comparison using CIFAR10 dataset with 50% symmetric noisy labels.TP refers to clean samples correctly selected into the labeled set, while FP refers to noisy samples mistakenly included in the labeled set.L, P, U, and O represent LongReMix[8], PLReMix[7], UNICON[28] and our method, respectively.(b) Comparison using CIFAR10 dataset with 40% asymmetric noisy labels.(c) Comparison using CIFAR10 dataset with 90% symmetric noisy labels.(d) Comparison using CIFAR10 dataset with 49% symmetric noisy labels.

Figure 4 .
Figure 4.An overview of the SSO-PLR process.First, we sample a mini-batch of size b from the labeled dataset X and set the sampling count t to 1.Then, we sample a mini-batch of the same size from the unlabeled dataset U .Pseudo-labels are generated for both batches, followed by a Mixup operation to enhance the model's generalization performance.Next, we compute the PLR loss and SSL loss and perform backpropagation.In this process, different from PLReMix, we introduce an oversampling mechanism to fully exploit feature information from unlabeled samples during the SSL process.If the sampling count t for the labeled dataset has not reached the maximum sampling times of the original dataset ( ) X U b + , even if we have completed sampling the entire labeled dataset, we resample the labeled subset and increment t , continuing training to learn the remaining samples in the unlabeled subset.Training stops for the current epoch e only when ( ) t X U b ≥ + .

y
 is the one-hot representation of i y  and ( ) sp ⋅ is a sharpen function used in previous works.The calculation of this sharpen function is as follows:

Figure 4 .
Figure 4.An overview of the SSO-PLR process.First, we sample a mini-batch of size b from the labeled dataset X and set the sampling count t to 1.Then, we sample a mini-batch of the same size from the unlabeled dataset U. Pseudo-labels are generated for both batches, followed by a Mixup operation to enhance the model's generalization performance.Next, we compute the PLR loss and SSL loss and perform backpropagation.In this process, different from PLReMix, we introduce an oversampling mechanism to fully exploit feature information from unlabeled samples during the SSL process.If the sampling count t for the labeled dataset has not reached the maximum sampling times of the original dataset (|X| + |U|)/b, even if we have completed sampling the entire labeled dataset, we resample the labeled subset and increment t, continuing training to learn the remaining samples in the unlabeled subset.Training stops for the current epoch e only when t ≥ (|X| + |U|)/b.

Algorithm 1 :
Training process pseudo-code representation Input: two networks m = 0 and m = 1; the warm-up epochs E w ; the total training epochs E tot ; batch size b; learning rate lr; thresholds τ s and υ; epoch counter e = 0; sampling counter t = 0; while e < E tot do: if e < E w : //enable Equation (2) only in the presence pf asymmetric noise labels pretrain the two networks on the whole dataset D using Equations (1) and (2); if e = (E w − 1): initialize class prototypes Q (m) c k c=1 for each network using Equation (18); //It is the same as PLReMix end if else: re-initialize the sampling counter t = 0;

Figure 6 .
Figure 6.The comparison of test accuracy (%) curves between some SOTA methods and our method using CIFAR-100.(a) Test accuracy curves of our method across varying symmetric noise rates.(b)Test accuracy curves of our method across varying asymmetric noise rates.(c) The comparison of test accuracy between five STOA methods (i.e., DivideMix, UNICON, LongReMix, PLReMix, and C2MT) and our method in the scenario of 80% symmetric noise.These methods were originally set to train for 300 epochs, while our method followed the parameter settings of PLReMix, which are set to 400 epochs.(d) The comparison of test accuracy between these STOA methods and our method in the scenario of 30% asymmetric noise.

Figure 6 .
Figure 6.The comparison of test accuracy (%) curves between some SOTA methods and our method using CIFAR-100.(a) Test accuracy curves of our method across varying symmetric noise rates.(b) Test accuracy curves of our method across varying asymmetric noise rates.(c) The comparison of test accuracy between five STOA methods (i.e., DivideMix, UNICON, LongReMix, PLReMix, and C2MT) and our method in the scenario of 80% symmetric noise.These methods were originally set to train for 300 epochs, while our method followed the parameter settings of PLReMix, which are set to 400 epochs.(d) The comparison of test accuracy between these STOA methods and our method in the scenario of 30% asymmetric noise.

Figure 7 .
Figure 7.The comparison of test accuracy (%) curves between some SOTA methods and our method using CIFAR-100.(a) Test accuracy curves of our method across varying symmetric noise rates.(b)Test accuracy curves of our method across varying asymmetric noise rates.(c) The comparison of test accuracy between five STOA methods (i.e., DivideMix, UNICON, LongReMix, PLReMix, and C2MT) and our method in the scenario of 80% symmetric noise.These methods were originally set to train for 300 epochs, while our method followed the parameter settings of PLReMix, which are set to 400 epochs.(d) The comparison of test accuracy between these SOTA methods and our method in the scenario of 30% asymmetric noise.

Figure 7 .
Figure 7.The comparison of test accuracy (%) curves between some SOTA methods and our method using CIFAR-100.(a) Test accuracy curves of our method across varying symmetric noise rates.(b) Test accuracy curves of our method across varying asymmetric noise rates.(c) The comparison of test accuracy between five STOA methods (i.e., DivideMix, UNICON, LongReMix, PLReMix, and C2MT) and our method in the scenario of 80% symmetric noise.These methods were originally set to train for 300 epochs, while our method followed the parameter settings of PLReMix, which are set to 400 epochs.(d) The comparison of test accuracy between these SOTA methods and our method in the scenario of 30% asymmetric noise.

Figure 8 .
Figure 8.The comparison of test accuracy (%) curves between reproduced methods and our method on Animal-10N.

Figure 8 .
Figure 8.The comparison of test accuracy (%) curves between reproduced methods and our method on Animal-10N.

Figure 9 .
Figure 9. Comparative ablation experiments for our method on CIFAR-10 with synthetic noisy labels.(a) Test accuracy (%) comparisons of different module combinations on CIFAR-10 with 80% symmetric noise."Rows#i" of the figure refers to the i-th row in Table 7.(b) Test accuracy (%) comparisons of different module combinations on CIFAR-10 with 40% asymmetric noise.

Figure 9 .
Figure 9. Comparative ablation experiments for our method on CIFAR-10 with synthetic noisy labels.(a) Test accuracy (%) comparisons of different module combinations on CIFAR-10 with 80% symmetric noise."Rows#i" of the figure refers to the i-th row in Table 7.(b) Test accuracy (%) comparisons of different module combinations on CIFAR-10 with 40% asymmetric noise.

Table 1 .
Overview of the datasets.

Table 2 .
The experimental settings of our method.

Table 3 .
The comparison of test accuracies (%) using CIFAR-10 across various noisy scenarios.The best accuracies are shown in bold.Underlines indicate reproduced results." †" denotes that the backbone is ResNet-32.

Table 4 .
The comparison of test accuracies (%) on CIFAR-100 across various noisy scenarios.The best accuracies are shown in bold.Underlines indicate reproduced results." †" denotes that the backbone is ResNet-34."x/x" means the last/best accuracies.

Table 5 .
A comparison of test accuracies (%) on Animal-10N.The best accuracies are shown in bold.Underlines indicate reproduced results."x/x" means the last/best accuracies." †" denotes that the backbone is ResNet-34.

Table 6 .
The comparison of test accuracies (%) on Clothing1M."*" indicates the backbone is PreAct ResNet-18.Underlines indicate reproduced results.The top-3 results are shown in bold.